This guide is written not for search users, but rather for search service administrators who create and configure a search service. Although we try to use explicit terminology as much as possible, sometimes the word "user" will be used instead of "search service administrator," when the meanings is clear from the context.
locust consists of a multi-threaded spidering program (spider) that downloads documents from the internet, a merging tool to convert and merge spider journals into fast searchable reverse indexes (deltamerge), a tool to manage and query the database (stortool), a search daemon (searchd), a CGI search front-end, producing template based HTML output or an XML document, and auxiliary tools. Following is a general description of the locust locust parts and their interaction.
The spider downloads documents from the internet, parses them for keywords and links, then stores document attributes and compressed documents themselves in MySQL tables and writes the spider journal. The spider journal contains records relating documents with keywords, links and document modification times. To increase performance, the spider uses threads to run multiple download workers simultaneously.
The merging tool processes the spider journal and creates a reverse index allowing fast search of documents with keywords. In the case of respidering, the merging tool merges the old reverse index with the spider journal that contains only differences between the previous and current state of documents on the internet.
The search daemon receives query terms and query parameters from a front end CGI via a UNIX socket, finds documents satisfying the query conditions, ranks them and sends the resulting top documents back to the frontend.
The frontend CGI displays the search form and search parameter controls. When a query is submitted, the frontend sends query terms and query parameters to the search daemon via a UNIX socket. After the search daemon completes its work, the frontend receives the search results. The HTML frontend then formats the results into an HTML page according to configuration templates. The XML frontend sends the results to the client in an XML format.
The storage tool is used to create, delete and clean indexes, compute index statistics and so on.
locust also includes a suite of auxiliary tools that monitor the search availability, display spidering results, report broken links and analyze search user behavior. We will describe the auxiliary tools in a separate chapter.
In order to avoid problems with file access permissions, we advice to run all locust executables under a single account.
The command lcreatedb creates an empty index database accessible using the MySQL user account locust as follows:
lcreatedb indexname
The command will ask you for the MySQL root password.
If for some reason a different account is required, the search administrator can edit lcreatedb, which is actually a shell script.
MySQL user account (not to be confused with the UNIX user account under which we run locust) is created with the MySQL statement, executed under the root account via, for example, the mysql interactive client:
grant usage on *.* to account@localhost identified by 'password';
where account is the desired account name and the password is the password to be set for this account.
The spider is started by the command
spider [-N number] indexnamewhere number is the number of simultaneously running download threads and indexname is the name of an index database. If -N is omitted, the spider runs with one downloader.
To cleanly stop the spider, the command
spider -E indexnameis used.
The spider reads configuration files sets.cnf, spider.cnf and storage.cnf in the index configuration directory named indexname. The spider outputs errors in the log file named indexname_spider.log.
After the spider has finished its work, the merging tool is started by the command
deltamerge indexnamewhere indexname is the name of an index database.
The merging tool builds the files storing ordered reverse index of keywords, ordered forward and backward links and last document modification times. It also computes page ranking based on link popularity of documents. When the merging tool finishes merging a file, it erases the corresponding spider journal data.
If the merging tool fails, it can be run again, although a small probability exists that the data will be corrupted.
After the reverse index is produced, the database is ready for search and a search daemon can be started with the command
/usr/local/sbin/asearchd -RD indexnamewhere indexname is the name of an index database. The options R and D mean that the program should be daemonized and restarted if crushed. Note that locust search daemon cannot be run under a priviledged (root) account for security reasons.
Search daemon can run during reindexing and merging. It will still produce useful results, although there may be some minor inconsistencies between index, stored documents and documents online.
Search daemon reads configuration files searchd.cnf and storage.cnf in the index configuration directory named indexname. The spider outputs errors in the log file named indexname_searchd.log.
The search daemon ranks document according to the document “weight,” a number between 0 and 1, where 1 corresponds to the most relevant documents. This number is computed from several partial weights (the weights of the title, the meta keywords, the meta description, the fontsize, the number of word occurrences and their distance from the document beginning, query term degree of clustering, link popularity) according to their share in the total weight specified in the searchd.cnf configuration file. The computation of partial weights will be described in a separate document.
locust search frontend is a CGI program that is executed by a web server. The frontend gets its input from the following sources:
Some of the frontend options can be specified in more than one source. In this case, CGI options have the first priority, followed by cookies and then the configuration file.
The HTML frontend uses the search results, input options and other information that may be necessary, to create an HTML page displaying search results and containing links and HTML form elements allowing to continue search and to navigate inside the search results. The XML frontend outputs the same information minus formatting and form elements in form of an XML document. In case the search fails, the frontend outputs the error information.
Search results are split into pages with each page containing the same number of results. Only one page at a time is displayed. The HTML frontend provides links to another result pages grouped into the result page navigation bar that can be placed above and/or below the search results. The number of results per page and the number of result page links in the navigator are configurable.
Depending on input options, the frontend produces the following output types.
Any name suitable for a CGI can be selected for a frontend. A hard link with this name must be made to the frontend executable named shtml.cgi for an HTML frontend or sxml.cgi for an XML frontend. The hard links are placed in a CGI directory of an HTTP server (see 3. Files and directories). There can be several frontends, differently configured, for one index.
The CGI option string is anything that follows the first ? character in the URL. When parameters are passed in the URL, they should be encoded according to standard RFC 1738, as described in the CGI Developer's Guide. If values are coming from HTML input fields, they are automatically encoded by the browser. In particular, spaces are changed to + or '%20' and special characters are replaced with %xx hexadecimal encoding. The options are separated with & signs and are in the form <name>=<value>. In this document values are shown in non-encoded form. The following is the description of the frontend options.
Parameters bd, ds, kw and tl can be used together in any combination. The only valid value is on. Any other value is ignored.
The storage tool is started by the command
stortool [-S | -C | -P url] [-w] indexwhere index is the name of an index database. The option descriptions follow:
In order to have an index database geared for search integrated with a (multi-)hierarchical directory (tree of categories), the index works with a collection of sets of documents each identified by a unique integer. These sets then serve as leaves of a (multi-)hierarchical tree of categories described in the config file dir.cnf to be used by the search daemon to restrict search results to a particular category.
In the file sets.cnf the sets of documents to be indexed are defined. The basic building block is a set of documents residing on one server, called a server set. Server sets residing on one or several servers are combined into a document set. Documents sets can be further organized into a multi-hierarchical tree of categories.
The main unit of sets.cnf is a server set description. For example,
Server http://www.isn.ethz.ch { Folder /researchpub/publihouse/ Allow "/pubs/ph/details\.cfm.*" NoIndex NoFolder /dossiers/terrorism/pubs/ Disallow ".*output_print\.cfm.*" Start /php/collections/coll_overview.htm Allow /php/collections/ Document /docu/update/2003/12-december/e1217a.htm }
Here is a description of server set specification statements.
Note, that the canonical server name should not end in '/'. The statements Disallow and Allow have a regular expression as their argument. Matching of regular expressions is case insensitive. Regular expressions containing spaces must be double-quoted. In this case, some characters contained in them must be escaped with the back slash according to the rules used for the C strings. The statements Start, Folder and NoFolder have a fixed server path as their argument. Indexing starts from one or more explicit (Start) and/or implicit (Folder) starting paths. For example, to start indexing from the root path, the statement Start / must be present.
Normally, a document must belong to only one server set. The spider checks this and gives a warning message if a document belongs to more than one server set.
Docset "Research Institutions" 1024 { // International Security Network Server http://www.isn.ethz.ch // Center for Nonproliferation Studies, Monterey Institute of // International Studies Server http://cns.miis.edu // Center for Peace, Conversion and Foreign Policy of Ukraine Server http://www.cpcfpu.org.ua { Folder /eng/ } }
Note that server sets from one server may appear in several Docsets.
The Allow and Disallow statements inside each server set are processed in the order they appear in the config file. If the path matches an Allow statement, it is classified as “to be indexed” and the processing is stopped. If it matches a Disallow statement first, it is classified as “not to be indexed” and the processing is stopped. If there is at least one Allow, Folder or Document statement in a category, the “Disallow .*” statement is implicitly added at the end, otherwise, “Allow .* is added.
The following shortcuts can be used. If a server set includes all the documents that can be found on the server, the block in the curly brackets can be omitted, for example:
Server http://www.nato.int
Indexing in this case will start from the root path “/”.
Uses of links external to site
There are two distinct reasons for following links outside: first is for the purpose of new server discovery and second for indexing the documents referenced by external links.
Using links to another website has several important applications:
Config file parameters controlling external links
The following set of config parameters allows all four cases to be treated in one framework.
OutServers <regex> { <config_statements> }
If one or more OutServers statements are present, the spider will follows outside links to servers not included in the config file, provided their names match one of the regex’s following an OutServers statement. The indexing configuration for an outside server is taken from the first matching OutServers block. If there are no OutServers statements, links to outside servers are not followed.
How exactly the outside links are followed is determined by the following parameter.
Examples
To index only the immediate outside links for the ISN site, the following configuration can be used.
FollowOutsideLinks yes Server www.isn.ethz.ch { Start / } OutServers “.*” { Allow / NoFollow }
To index only the first page of any .ch server found, the following configuration can be used.
FollowOutsideLinks no Server www.isn.ethz.ch { Start / } OutServers “.*\.ch” { Start / Allow / NoFollow }
The spider has the following configuration parameters.
MaxDocsPerServer - Limits the number of documents downloaded by the spider. If not set, there is no limit. The limit includes also failed download attempts.
MaxDocSize - Maximum size of downloaded document in bytes. Longer documents are truncated. Default value 1048576 bytes(1 Mb).
IgnoreSuffix - Specifies URL suffixes to be ignored on technical grounds, for example executables .exe. The suffixes are matched case insensitive. Most common file types to be ignored are already specified in the sample configuration file. Not to be used to exclude URLs on content ground, the appropriate place to do it is the sets.cnf file.
IgnoreRegex - Same as above, only regular expressions are used to exclude matching suffixes.
Regular expressions containing space must be double-quoted. In this case, some characters in them must be escaped with the back slash according to the rules used for the C strings. There may be multiple IgnoreSuffix and IgnoreRegex statements each containing multiple suffixes or regular expressions. Items specified in these statements are added to the set of suffixes or regular expressions to be used for filtering URLs.
Converter - Specify application to convert document types other than text or HTML, for example .pdf files. The form is as following
Converter mimetypein mimetypeout "application $in $out"where mimetypein is the original document MIME type, mimetypeout is text/html or text/plain, application is a path to converter application executable together with required command line arguments, and $in and $out are literal placeholders for actual input and output temporary files.
User - MySQL user account used to create the index database. Normally, user account locust is used.
Passwd - MySQL password for the user specified above.
Host - Host on which MySQL server is running. Normally, localhost is used.
Stordir - Storage directory, by default /var/locdata.
WordCacheSize - Size of cache used to reduce number of database queries when finding the numerical handle for a word during spidering. The optimal size of the cache depends on many factors and we cannot give here the exact formula to determine it. In general, if there is enough memory, try to increase the size and see if spider performance increases. Look also at the hit ratio that the spider outputs at the end of it run. The default size is 50000.
HrefCacheSize - Size of cache used to reduce number of database queries when finding the numerical handle for a link during spidering. Same consideration apply as above, only the number of links in a document is usually smaller than the number of words and a smaller size than the one used for the word cache is adequate. The default size is 10000.
DBType - Database type, must always be set to mysql.
DBUser - MySQL user account used to create the index database. Normally, user locust is used.
DBPass - MySQL password for the user specified above.
DBHost - Host on which MySQL server is running. Normally, localhost is used.
DataDir - Storage directory, by default /var/locdata.
Port - Port number to be used to connect to the frontend.
AllowFrom - is used to limit access to the search daemon. Argument can be hostname, IP address or address/mask (in CIDR form). In latter case, access is allowed to the whole network. If there is no AllowFrom, access is unlimited.
MinFixedPatternLength - Minimal length of fixed part of word when using pattern in a query (like someth*). Words shorter than this value will be rejected with appropriate error message. Default value is 6.
Include ucharset.cnf - Must always be present.
Include stopwords.cnf - Stopwords for all or for selected language(s) can be included.
Ranking weight shares are specified in percentage points and their sum must be 100.
TitleWeight - The title weight share.
MetakwdWeight - The meta keyword weight share.
MetadescrWeight - The meta description weight share.
FontsizeWeight - The fontsize weight share.
DistWeight - The weight depending on the number of keyword occurrences and distances from the beginning of the document share.
ClusterWeight - Clustering weight share.
RankWeight - Link popularity weight share.
A frontend configuration file name is constructed by replacing the .cgi suffix in the CGI executable name by the .cnf suffix. Configuration file is located in the directory /etc/locust/frontend/. If, for example, a CGI executable is named xscorm.cgi, it will use the configuration file xscorm.cnf.
The following options are defined in the configuration file.
Templates are named snippets of HTML code that are used to generate search result pages. Search result pages can be customised by editing templates.
Templates may contain placeholders for frontend CGI variables and for other templates to be nested in them. A placeholder for a CGI variable consists of the variable name preceded by the dollar sign '$'. A placeholder for a nested template consists of the template name preceded by the prefix "$T".
Templates are stored in a file (normally with the suffix .tmpl). This file must be specified in the corresponding frontend configuration file as the argument to the parameter "templateFile" using either the absolute path or the path relative to the config directory /etc/locust.
After search parameters that are passed through CGI arguments are read and search results are received from the search daemon, the frontend CGI substitutes variable values instead of placeholders. Next, nested templates are inserted instead of corresponding placeholders located in other templates. Template nesting can be several levels deep and the whole process gets quite complicated. Finally, first level templates are placed in the CGI output in the correct order.
To further complicate HTML page generation, in certain cases there is a need for more complex manipulations with template text than simple value substitution. In particular, certain CGI arguments have to be repeated in the href values in templates in the form of a CGI argument "&<var-name>=<value>", only if the <value> is non-empty. To set up correct initial value for checkbox and option elements, boolean and integer variables respectively should be used to output corresponding attributes "checked" or "selected", depending on variable value.
To do these manipulations in a clean and general way, we introduce the concept of slaved variable. The following line describes a slaved variable syntax:
$<var-name>:<type>[(<arg>)]
A slaved variable value is computed from the master variable <var-name> value and the argument <arg> depending on the slaved variable type <type>.
Here is the description of the types of the slaved variables and the rules for generating their values.
"&<var-name>=<var-value>",empty otherwise.
"&q=money"if the value is not empty and by the empty string otherwise.
<input type="checkbox" name="tl" value="on" $tl:C(on)>if the master variable tl have the value "on", the slave variable $tl:c(on) will be replaced by "checked", otherwise it will be empty.
<option value="20" $ps:s(20)>if the master variable ps has the value "20", the slave variable $ps:s(on) will be replaced by "selected". Otherwise it will be empty.
For better separation between generic HTML code and customization code, the frontend recognizes include operators in the form
Include <filepath>where <filename> is an absolute path or a path relative to the template file directory to a file to be included.
We recommend that customization HTML code is kept, as far as it is practical, in separate include files. In particular, we recommend to use head.incl file for code to be included in the HTML head, top.incl for code to appear on top of the page (both to be included inside the template top) and bottom.incl for code to appear on the bottom of the page (to be included inside the template bottom).
Using Unix link mechanism, a single copy of customization files can be used in several installations.
Below we describe frontend variables and templates. Most of this information will not be needed by a web designer configuring search pages to suit the look and feel of his or her site. In most cases, it will be enough to create above mentioned include files to customize the top and the bottom templates. Some small changes in style made in other templates may be also desired. Thus, customization work does not require complete understanding of variables and of intricacies of HTML page generation. It will be enough to understand the basic principles and use the rest of this paragraph as a reference when required.
The following variables are recognized by the locust frontend. The variables describing search parameters have default values (often empty) and can be assigned non-default values via the front-end configuration files, cookies and CGI options. These variables are usually named after corresponding CGI options. The variables describing search results are assigned values received by the frontend from the search daemon.
Constants
Miscellaneous
Search forms
Subset search
Document field search restrictions
Result presentation options
Properties of result document set
Individual document properties
Search statistics
Variables depending on the current query
Variables used in the result page navigation bar to form links to other search result pages
Information about document's site for the "moreurls" template
Advanced search
A template in the template file has the following form
<template-type> <template-name> { <template-body> }
where <template-type> is one of two keywords Template or iTemplate, <template-name> is the name of the template, and <template-body> can be any snippet of HTML code with frontend variable and nested template placeholders inserted. The only difference between the two template types is that iTemplate (from inner template) is a short one line template whose body is substituted into other templates without the terminating new line character. This is done to preserve readability of HTML code produced by the frontend.
To accommodate cases where HTML code is conditional, for example the "Cached" link for HTML document versus the "Text version" link for PDF documents, dynamic templates are used. They take value of one of several templates depending on a condition or a template value versus an empty string. Dependence of dynamic templates on conditions and basic templates is hard-wired into the frontend, rather than being expressed in the templates file. In case of the result page navigator bar, the dynamic template "dnavbars" is generated using more complex hard-wired rules than simply a conditional substitution.
Top level templates
Single result templates repeated according to number of results
Navigator between the result pages, nested in restop and resbot
Templates nested in res0
Highlighting query words in title and excerpts
Opening and closing of excerpts
Cached document mode
Highlighting query words in a cached document (with colors)
Error and warning messages
Here are the files and directories used by locust.
/usr/local/sbin - Executables are placed here by the installation procedure.
/etc/locust - Directory for locust configuration files. Configuration directories named after specific indexes are placed here. Also contains charset configuration that includes the file ucharset.cnf and the directory chartabs with charset definitions and stopwords configuration that includes the file stopwords.cnf and the directory stopwords with stopword definitions.
/etc/locust/frontend - Directory for the frontend configuration files, also contains the file port_assignments in which ports used for communication between frontends and search daemons must be recorded.
/var/locdata - This is the location of the index databases, can be changed in the storage.cnf file.
/var/locdata/mysql - MySQL data directory, can be changed in the storage.cnf file.
/var/www/cgi-bin - Frontend CGI's shtml.cgi and sxml.cgi are placed here by the installation procedure.
/var/locust/log – Logging file directory. Contains the global log file locust.log and index specific log file with the names formed as indexname_spider.log for spiders and indexname_searchd.log for search daemons, where indexname is an index database name.
/var/locust/run – Run time file directory to keep status information, for example the locwatch monitoring tool keeps here information on alerts, so repeated alerts can be blocked.
/etc/locust/auxtools – Auxiliary tools configuration directory.
/etc/init.d/locust - A start up script that restarts search daemons on boot-up.
Error detection, error handling and error reporting constitute a surprisingly large and non-trivial part of any serious program. Executables of locust detect and report over one hundred different errors. Errors can be classified by their severity, subject, intended audience and a way an error message is delivered to its recipient. When talking about errors in this paragraph and in the rest of this chapter, we usually mean "errors, warnings and information messages."
We will not (at least not yet) describe every error here, but rather outline error treatment in locust. Most of individual error messages are self-explanatory.
Currently, the error system is not completely finished. What is missing are configuration options that will allow a search administrator to regulate error logging and output.
Error messages have the following general form.
<errcode> <level> [<time>] <message> <details>
where <errcode> is the numerical error code, <level> is the severity level, <time> is the current local time, <message> is the message verbal content, and <details> is a specific information, like a URL, or a file name. In some cases, for stylistic reasons, <message> and <details> can stand in different order. Some messages are not assigned error code yet and in this case the code is omitted.
locust log files are located in the directory /var/locust/log. Messages concerning the starting and stopping a spider or a search daemon are written into the global log file locust.log. Otherwise messages are logged into index specific log files with the names formed as indexname_spider.log for spiders and indexname_searchd.log for search daemons, where indexname is an index database name. If an executable cannot open an index specific log file, it logs an error message in the global log file and exits.
The spider and other executables except the search daemon print messages to the standart output. Note: In the future logging will become configurable, and it will be possible to block printing to the standard output completely or partially.
The monitoring tool locwatch uses email to send error messages to the email addresses specified in its configuration file. SMS messages can be send via an email to SMS gateway.
If the spider is run via the cron, cron can be configured to send spider standard output to email addresses of interested users.
HTML and XML frontends send errors to the browser for the search service user to see.
locust messages have four different target audiences.
locust messages are divided by subject into the following groups.
Errors in the configuration files. Intended audience is search service administrators.
Lack of disk space, memory, failure opening files. Intended audience is search service administrators.
Errors resolving host IP address.
Errors when attempting to download documents. These errors are also named "extended HTTP" errors because their codes are used in the document status as an extension of HTTP status.
HTTP protocol status codes, except the 200 OK code.
Spidered site errors include robots.txt errors, missing and misconfigured links, charset specification errors, some document errors (invalid UTF-8 characters, tags not closed).
Errors occuring when storing data into the index database and reading from it.
Program inconsistency messages are addressed to locust developers.
locust includes a suite of auxiliary tools that monitor search availability, display spidering results, report broken links, and analyze user behavior. Most tools are implemented as CGIs and started via a browser based interface.
Monitoring tool locwatch is periodically executed by the cron. It checks the search services specified in its configuration file and if a service failed sends an alarm by email or SMS. The following crontab entry checks search services specified in a configuration file locwatch10 every 10 minutes.
0-50/10 * * * * /usr/local/sbin/locwatch -f locwatch10
The configuration file can contain the following configuration parameters.
FailRecipients - Specify email addresses of fail message recipients.
RestoreRecipients - Specify email addresses of restore message recipients.
Server - Specify server name and in the following block the server IP can be specified (to avoid frequent DNS requests) and the search paths on the server to check with the minimum number of documents.
IP - Server IP inside a server block
Path - The path inside a server block. This parameter has two arguments. The first is a path on the server including search CGI and a query. The second is the minimum number of arguments this search must return to be considered working properly.
An example of a configuration file can be found in the distribution under the path files/etc/auxtools/locwatch10
The CGI tool sitechk reports broken and misconfigured links in a particular index database. When opened, the list of index databases is presented in a drop-down menu. Select desired database. To include redirected documents in the report, select the "Include redirected" checkbox. Then click the "Submit" button and a list of pages containing broken or misconfigured links will appear in the browser.
The CGI tool servrep reports the number of documents in a particular index database per site and per download status. It helps to find closed, moved or non-working sites.
sestat displays a list of the most frequent search queries and search terms. It is compiled for a given period of time (normally one month). Also, the tool compiles lists of keywords which gained the most and lost the most in popularity compared to the previous period of time.
logsplit helps an administrator to analyze spider logs that can be very large. It splits the log file into separate files for every error type.
When many indexes are run on a server, organization of search services becomes important.
In Linux, the cron scheduling facility can be used to periodically respider indexes.
For an index to be periodically respidered, create an entry in the crontab table associated with the account used to run locust. locust installation script installs an initial crontab table with a commented out sample entry.
To edit the table use the command crontab -e. When inside the crontab editor, copy the sample entry, uncomment it and replace the index name by the one you desire. Set the desired respidering times using the crontab syntax.
locust installation script installs an initialization script locust with a commented out sample entry into the Linux initialization and termination scripts directory /etc/init.d. To insure that search services are automatically restored upon rebooting, the script must be activated with the command
chkconfig --add locust
For each instance of search daemon add an entry to the script /etc/init.d/locust. To do this, just copy provided sample entry, uncomment it and replace the index name by the one you desire.
The search daemons can be manually started with the command pre> /etc/init.d/locust start and stopped with the command pre> /etc/init.d/locust stop
A search form can be installed on any web page by inserting the following HTML code.
<form method="GET" action="http://isn-search.ethz.ch/cgi-bin/example.cgi"> <table> <tr> <td> <input type="text" name=q size=35 value=""> </td> <td> <input type="submit" value="Search"> </td> </tr> </table>
where http://isn-search.ethz.ch/cgi-bin/example.cgi is a locust frontend URL.
To place additional CGI options in the URL of the linked search page, the command line hidden input type tags can be inserted into the form. For example, the tag
<input type="hidden" name="gr" value="on">
puts the option gr=on in the URL of the search page. This option cases search results on the linked page to be grouped by sites.